Focused Crawling: A Means to Acquire Biological Data from the Web

نویسنده

Ari Pirkola

چکیده

Experience paper. World Wide Web contains billions of publicly available documents (pages) and it grows and changes rapidly. Web search engines, such as Google and Altavista, provide access to indexable Web documents. An important part of a search engine is a Web crawler whose function is to collect Web pages for the search engine. Due to the Web’s immense size and dynamic nature no crawler is able to cover the entire Web and to keep up with all the changes. This fact has pushed the development of focused crawlers. In contrast to crawlers used by the general search engines focused crawlers selectively download Web documents, restricting the scope of crawling to a predefined domain. The downloaded documents can be stored and used as a source for data mining. In this paper we describe the main features of focused crawling, discuss the research on focused crawling conducted by the research group of the author, and discuss the problem areas associated with focused crawling not discussed in the literature which our work so far has revealed. Our test data consisted of Web documents in the genomics domain.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Unsupervised Relation Extraction of In-Domain Data from Focused Crawls

This thesis proposal approaches unsupervised relation extraction from web data, which is collected by crawling only those parts of the web that are from the same domain as a relatively small reference corpus. The first part of this proposal is concerned with the efficient discovery of web documents for a particular domain and in a particular language. We create a combined, focused web crawling ...

متن کامل

Design and Implementation of Focused Web Crawler Using Genetic Algorithm: An Approach to Web Mining

The speed at which World -Wide -Web (WWW) is growing round the clock spreds its arms from smaler collections of web pages to a massive hub of web information which gradually increases the complexity of crawling process.search engines handles enourmous quaries from different part of the univers to retrieve most of the relevant results in response to answer the user queries, and it is solely depe...

متن کامل

Language Specific and Topic Focused Web Crawling

We describe an experiment on collecting large language and topic specific corpora automatically by using a focused Web crawler. Our crawler combines efficient crawling techniques with a common text classification tool. Given a sample corpus of medical documents, we automatically extract query phrases and then acquire seed URLs with a standard search engine. Starting from these seed URLs, the cr...

متن کامل

Ontology Driven Focused Crawling of Web Documents

In recent year dynamism of the World Wide Web , the issue of discovering relevant web pages has become an important challenge. Focused crawler aims at selectively seeking out pages that are relevant to a pre-defined set of topics. Most of the current approaches perform syntactic matching, that is, they retrieve documents that contain particular keywords from the user’s query. This often leads t...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

Focused Crawling: A Means to Acquire Biological Data from the Web

نویسنده

چکیده

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

Unsupervised Relation Extraction of In-Domain Data from Focused Crawls

Design and Implementation of Focused Web Crawler Using Genetic Algorithm: An Approach to Web Mining

Language Specific and Topic Focused Web Crawling

Ontology Driven Focused Crawling of Web Documents

عنوان ژورنال:

اشتراک گذاری